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METHOD AND APPARATUS FOR 
MULTI-LEVEL DISTRIBUTED SPEECH RECOGNITION 



Field Of The Invention 

The invention relates generally to communication devices and 
methods and more particularly to communication devices and 
10 methods employing speech recognition. 

Background Of The Invention 

An emerging area of technology involving terminal devices, such 
a handheld devices, Mobile Phone, Laptops, PDAs, Internet 

15 Appliances, desktop computers, or suitable devices, is the application 
of information transfer in a plurality of input and output formats. 
Typically resident on the terminal device is an input system allowing a 
user to enter information, such as specific information request. For 
example, a user may use the terminal device to access a weather 

20 database to obtain weather information for a specific city. Typically, 
the user enters a voice command asking for weather information for a 
specific location, such as ''Weather in Chicago." Due to processing 
limitations associated with the terminal device, the voice command 
may be forwarded to a network element via a communication link, 

25 wherein the network element is one of a plurality of network elements 
within a network. The network element contains a speech recognition 
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engine that recognizes the voice command and then executes and 
retrieves the user-requested information. Moreover, the speech 
recognition engine may be disposed within the network and operably 
coupled to the network element instead of being resident within the 
5 network element, such that the speech recognition engine may be 
accessed by multiple network elements. 

With the advancement of wireless technology, there has been an 
increase in user applications for wireless devices. Many of these 
devices have become more interactive, providing the user the ability to 

§ssfe 

q 10 enter command requests, and access information. Concurrently, with 

id 

,Jp the advancement of wireless technology, there has also been an 

4* increase in the forms a user may submit a specific information 

I request. Typically, a user can enter a command request via a keypad 

iy wherein the terminal device encodes the input and provides it to the 

□ 15 network element. A common example of this system is a telephone 
banking system where a user enters an account number and personal 
identification number (PIN) to access account information. The 
terminal device or a network element, upon receiving input via the 
keypad, converts the input to a dual tone multi-frequency signal 
20 (DTMF) and provides the DTMF signal to the banking server. 

Furthermore, a user may enter a command, such as an 
information request, using a voice input. Even with improvements in 
speech recognition technology, there are numerous processing and 
memoiy storage requirements that limit speech recognition abilities 
25 within the terminal device. Typically, a speech recognition engine 



includes a library of speech models with which to match input speech 
commands. For reliable speech recognition, often times a large library 
is required, thereby requiring a significant amount of memory. 
Moreover, as speech recognition capabilities increase, power 
5 consumption requirements also increase, thereby shorting the life 
span of a terminal device battery. 

The terminal speech recognition engine may be an adaptive 
system. The speech recognition engine, while having a smaller library 
of recognized commands, is more adaptive and able to understand the 

10 user's distinctive speech pattern, such as tone, inflection, accent, etc. 
Therefore, the limited speech recognition library within the terminal is 
offset by a higher degree of probability of correct voice recognition. 
This system is typically limited to only the most common voice 
commands, such as programmed voice activated dialing features 

15 where a user speaks a name and the system automatically dials the 
associated number, previously programmed into the terminal. 

Another method for voice recognition is providing a full voice 
command to the network element. The network speech recognition 
engine may provide an increase in speech recognition efficiency due to 

20 the large amount of available memory and reduced concerns regarding 
power consumption requirements. Although, on a network element, 
the speech recognition engine must be accessible by multiple users 
who access the multiple network elements, therefore a network speech 
recognition engine is limited by not being able to recognize distinctive 

25 speech patterns, such as an accent, etc. As such, network speech 
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recognition engines may provide a larger vocabulary of voice 
recognized commands, but at a lower probability of proper recognition, 
due to inherent limitations in individual user speech patterns. 

Also, recent developments provide for multi-level distributed 
5 speech recognition where a terminal device attempts to recognize a 
voice command, and if not recognized within the terminal, the voice 
command is encoded and provided to a network speech recognition 
engine for a second speech recognition attempt. United State Patent 
No. 6,185,535 Bl issued to Hedin et al., discloses a system and 

10 method for voice control of a user interface to service applications. 
This system provides step-wise speech recognition where the at least 
one network speech recognition engine is only utilized if the terminal 
device cannot recognize the voice command. United States Patent No. 
6,185,535 only provides a single level of assurance that the audio 

15 command is correctly recognized, either from the terminal speech 
recognition engine or the network speech recognition engine. 

As such, there is a need for improved communication devices 
that employ speech recognition engines. 



20 Brief Description Of The Drawings 

The invention will be more readily understood with reference to 
the following drawings contained herein. 

FIG. 1 illustrates a prior art wireless system. 
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FIG. 2 illustrates a block diagram of an apparatus for multi- 
level distributed speech recognition in accordance with one 
embodiment of the present invention. 

FIG. 3 illustrates a flow chart representing a method for multi- 
level distributed speech recognition in accordance with one 
embodiment of the present invention. 

FIG. 4 illustrates a block diagram of a system for multi-level 
distributed speech recognition in accordance with one embodiment of 
the present invention. 

FIG. 5 illustrates a flow chart representing a method for multi- 
level distributed speech recognition in accordance with one 
embodiment of the present invention. 

Detailed Description Of a Preferred Embodiment of The Invention 

Generally, a system and method provides for multi-level 
distributed speech recognition through a terminal speech recognition 
engine, operably coupled to a microphone within an audio subsystem 
of a terminal device, receiving an audio command, such as a voice 
command provided from a user, e.g. "Weather in Chicago," and 
generating at least one terminal recognized audio command, wherein 
the at least one terminal recognized audio commands has a 
corresponding terminal confidence value. 

The system and method further includes a network element, 
within a network, having at least one network speech recognition 



engine operably coupled to the microphone within the terminal, 
receiving the audio command and generating at least one network 
recognized audio command, wherein the at least one network 
recognized audio command has a corresponding network confidence 
value. 

Moreover, the system and method includes a comparator, a 
module implemented in hardware or software that compares the 
plurality of recognized audio commands and confidence values. The 
comparator is operably coupled to the terminal speech recognition 
engine for receiving the terminal-recognized audio commands and the 
terminal speech recognition confidence values, the comparator is 
further coupled to the network speech recognition engine for receiving 
the network-recognized audio commands and the network speech 
recognized confidence values. The comparator compares the terminal 
voice recognition confidence values and the network voice recognition 
confidence values, compiling and sorting the recognized commands by 
their corresponding confidence values. In one embodiment, the 
comparator provides a weighting factor for the confidence values 
based on the specific speech recognition engine, such that confidence 
values from a particular speech recognition engine are given greater 
weight than other confidence values. 

Operably coupled to the comparator is a dialog manager, which 
may be a voice browser, an interactive voice response unit (IVR), 
graphical browser, JAVA®, based application, software program 
application, or other software /hardware applications as recognized by 



one skilled in the art. The dialog manager is a module implemented in 
either hardware or software that receives, interprets and executes a 
command upon the reception of the recognized audio commands. The 
dialog manager may provide the comparator with an N-best indicator, 
which indicates the number of recognized commands, having the 
highest confidence values, to be provided to the dialog manager. The 
comparator provides the dialog manager the relevant list of recognized 
audio commands and their confidence values, i.e. the N-best 
recognized audio commands and their confidence values. Moreover, if 
the comparator cannot provide the dialog manager any recognized 
audio commands, the comparator provides an error notification to the 
dialog manager. 

When the dialog manager receives one or more recognized audio 
commands and the corresponding confidence values, the dialog 
manager may utilize additional steps to further restrict the list. For 
example, it may execute the audio command with the highest 
confidence value or present the relevant list to the user, so that the 
user may verify the audio command. Also, in the event the dialog 
manager receives an error notification or none of the recognized audio 
commands have a confidence value above a predetermined minimum 
threshold, the dialog manager provides an error message to the user. 

If the audio command is a request for information from a 
content server, the dialog manager accesses the content server and 
retrieves encoded information. Operably coupled to the dialog 
manager is at least one content server, such as a commercially 
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available server coupled via an internet, a local resident server via an 
intranet, a commercial application server such as a banking system, 
or any other suitable content server. 

The retrieved encoded information is provided back to the dialog 
5 manager, typically encoded as mark-up language for the dialog 
manager to decode, such as hypertext mark-up language (HTML), 
wireless mark-up language (WML), extensive mark-up language (XML), 
Voice extensible Mark-up Language (VoiceXML), Extensible HyperText 
Markup Language (XHTML), or other such mark-up languages. 

10 Thereupon, the encoded information is decoded by the dialog manager 
and provided to the user. 

Thereby, the audio command is distributed between at least two 
speech recognition engines which may be disposed on multiple levels, 
such as a first speech recognition engine disposed on a terminal 

15 device and the second speech recognition disposed on a network. 

FIG. 1 illustrates a prior art wireless communication system 
100 providing a user 102 access to at least one content server 104 via 
a communication link 106 between a terminal 108 and a network 
element 110. The network element 110 is one of a plurality of 

20 network elements 110 within a network 112. A user 102 provides an 
input command 114, such as a voice command, e.g. "Weather in 
Chicago/' to the terminal 108. The terminal 108 interprets the 
command and provides the command to the network element 110, via 
the communication link 106, such as a standard wireless connection. 
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The network element 110 receives the command, processes the 
command, i.e. utilizes a voice recognizer (not shown) to recognize and 
interpret the input command 114, and then accesses at least one of a 
plurality of content servers 104 to retrieve the requested information. 
5 Once the information is retrieved, it is provided back to the network 
element 110. Thereupon, the requested information is provided to the 
terminal 108, via communication link 106, and the terminal 108 
provides an output 1 16 to the user, such as an audible message. 

In the prior art system of FIG. 1, the input command 114 may 

10 be a voice command provided to the terminal 108. The terminal 108 
encodes the voice command and provides the encoded voice command 
to the network element 110 via communication link 106. Typically, a 
speech recognition engine (not shown) within the network element 110 
will attempt to recognize the voice command and thereupon retrieve 

15 the requested information. As discussed above, the voice command 
114 may also be interpreted within the terminal 108, whereupon the 
terminal then provides the network element 110 with request for the 
requested information. 

It is also known within the industry to provide the audio 

20 command 114 to the terminal 108, whereupon the terminal 108 then 
attempts to interpret the command. If the terminal 108 should be 
unable to interpret the command 114, the audio command 114 is 
then provided to the network element 110, via communication link 
106, to be recognized by a at least one network speech recognition 

25 engine (not shown) . This prior art system provides for step-wise voice 
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recognition system whereupon a at least one network speech 
recognition engine is only accessed if the terminal speech recognition 
engine is unable to recognize the voice command. 

FIG. 2 illustrates an apparatus for multi-level distributed 
speech recognition, in accordance with one embodiment of the present 
invention. An audio subsystem 120 is operably coupled to both a first 
speech recognition engine 122 and at least one second speech 
recognition engine 124, such as OpenSpeech recognition engine 1.0, 
manufactured by SpeechWorks International, Inc. of 695 Atlantic 
Avenue, Boston, MA 02111 USA. As recognized by one skilled in the 
art, any other suitable speech recognition engine may be utilized 
herein. The audio subsystem 120 is coupled to the speech recognition 
engines 122 and 124 via connection 126. The first speech recognition 
engine 122 is operably coupled to a comparator 128 via connection 
130 and the second speech recognition 124 is also operably coupled to 
the comparator 128 via connection 132. 

The comparator 128 is coupled to a dialog manager 134 via 
connection 136. Dialog manager is coupled to a content server 138, 
via connection 140, and a speech synthesis engine 142 via connection 
144. Moreover, the speech synthesis engine is further operably 
coupled to the audio subsystem 120 via connection 146. 

The operation of the apparatus of FIG. 2 is describe with 
reference to FIG. 3, which illustrates a method for multi-level 
distributed speech recognition, in accordance with one embodiment of 
the present invention. The method begins, designated at 150, when 
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the apparatus receives an audio command, step 152. Typically, the 
audio command is provided to the audio subsystem 120. More 
specifically, the audio command may be provided via a microphone 
(not shown) disposed within the audio subsystem 120. As recognized 
5 by one skilled in the art, the audio command may be provided from 
any other suitable means, such as read from a memory location, 
provided from an application, etc. 

Upon receiving the audio command, the audio subsystem 
provides the audio command to the first speech recognition engine 

10 122 and the at least one second speech recognition engine 124, 
designated at step 154. The audio command is provided across 
connection 126. Next, the first speech recognition engine 122 
recognizes the audio command to generate at least one first recognized 
audio commands, wherein the at least one first recognized audio 

15 commands has a corresponding first confidence value, designated at 
step 156. Also, the at least one second speech recognition engine 
recognizes the audio command to generate at least one second 
recognized audio commands, wherein the at least one second 
recognized audio command has a corresponding second confidence 

20 value, designated at step 158. The at least one second speech 
recognition engine recognizes the same audio command as the first 
speech recognition engine, but recognized the audio command 
independent of the first speech recognition engine. 

The first speech recognition engine 122 then provides the at 

25 least one first recognized audio command to the comparator 128, via 
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connection 130 and the at least one second speech recognition engine 
124 provides the at least one second speech recognized audio 
command to the comparator 128, via connection 132. The 
comparator, in one embodiment of the present invention, weights the 
5 at least one first confidence value by a first weight factor and weights 
the at least one second confidence value by a second weight factor. 
For example, the comparator may give deference to the recognition of 
the first speech recognition engine, therefore, the first confidence 
values may be multiplied by a scaling factor of .95 and the second 

10 confidence values may be multiplied by a scaling factor of .90, 
designated at step 160. 

Next, the comparator selects at least one recognized audio 
command, having a recognized audio command confidence value from 
the at least one first recognized audio command and the at least one 

15 second recognized audio commands, based on the at least one first 
confidence values and the at least one second confidence values, 
designated at step 162. In one embodiments of the present invention, 
the dialog manager provides the comparator with an N-best indicator, 
indicating the number of requested recognized commands, such as 

20 the five-best recognized commands where the N-best indicator is five. 

The dialog manager 134 receives the recognized audio 
commands, such as the N-best recognized audio commands, from the 
comparator 128 via connection 136. The dialog manager then 
executes at least one operation based on the at least one recognized 

25 audio command, designated as step 164. For example, the dialog 



13 



manager may seek to verify the at least one recognized audio 
commands, designated at step 166, by providing the N-best list of 
recognized audio commands to the user for user verification. In one 
embodiments of the present invention, the dialog manager 134 
5 provides the N-best list of recognized audio commands to the speech 
synthesis engine 142, via connection 144. The speech synthesis 
engine 142 synthesizes the N-best recognized audio commands and 
provides them to the audio subsystem 120, via connection 146. 
Whereupon, the audio subsystem provides the N-best recognized list 

10 to the user. 

Moreover, the dialog manager may perform further filtering 
operations on the N-best list, such as comparing the at least one 
recognized audio command confidence values versus a minimum 
confidence level, such as 0.65, and then simply designate the 

15 recognized audio command having the highest confidence value as the 
proper recognized audio command. Wherein, the dialog manager then 
executes that command, such as accessing a content server 138 via 
connection 140 to retrieve requested information, such as weather 
information for a particular city. 

20 Furthermore, the comparator generates an error notification 

when the at least one first confidence value and the at least one 
second confidence value are below a minimum confidence level, 
designated at step 168. For example, with reference to FIG. 2, the 
comparator 128 may have an internal minimum confidence level, such 

25 as 0.55 with which the first confidence values and second confidence 
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values are compared. If none of the first confidence values or the 
second confidence values are above the minimum confidence level, the 
comparator issues an error notification to the dialog manager 134, via 
connection 176. 

5 Moreover, the dialog manager may issue an error notification in 

the event the recognized audio commands, such as within the N-best 
recognized audio commands, fail to contain a recognized confidence 
value above a dialog manager minimum confidence level. An error 
notification is also generated by the comparator when the first speech 

10 recognition engine and the at least one second speech recognition 
engine fail to recognize any audio commands, or wherein the 
recognized audio commands are below a minimum confidence level 
designated by the first speech recognition engine, the second speech 
recognition engine, or the comparator. 

15 When an error notification is issued, either through the 

comparator 128 or the dialog manager 134, the dialog manager then 
executes an error command wherein the error command is provided to 
the speech synthesis engine 142, via connection 144 and further 
provided to the end user via the audio subsystem 120, via connection 

20 146. As recognized by one skilled in the art, the error command may 
be provided to the user through any other suitable means, such as 
using a visual display. 

Thereupon, the apparatus of FIG. 2 provides for multi-level 
distributed speech recognition. Once the dialog manager executes an 
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operation in response to the at least one recognized command, the 

method is complete, designated at step 170. 

FIG. 4 illustrates a multi-level distributed speech recognition 

system, in accordance with one embodiment to the present invention. 
5 The system 200 contains of a terminal 202 and a network element 

204. As recognized by one skilled in the art, the network element 204 

is one of a plurality of network elements 204 within a network 206. 

The terminal 202 has an audio subsystem 206 that contains, 

among other things, a speaker 208 and a microphone 210. The audio 
10 subsystem 206 is operably coupled to a terminal voice transfer 

interface 212. Moreover, a terminal session control 214 is disposed 

within the terminal 202. 

The terminal 202 also has a terminal speech recognition engine 

216, such as found in the Motorola i90c™ which provides voice 
15 activated dialing, manufactured by Motorola, Inc. of 1301 East 

Algonquin Road, Schaumburg, Illinois, 60196 USA, operably coupled 

to the audio subsystem 206 via connection 218. As recognized by one 

skilled in the art, other suitable speech recognition engines may be 

utilized herein. The terminal speech recognition engine 216 receives 
20 an audio command 220 originally provided from a user 222, via the 

microphone 210 within the audio subsystem 206. 

The terminal session control 214 is operably coupled to a 

network element session control 222 disposed within the network 

element 204. As recognized by one skilled in the art, the terminal 
25 session control 214 and the network element session control 222 
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communicate upon the initialization of a communication session, for 
the duration of the session, and upon the termination of the 
communication session. For example, providing address designations 
during an initialization start-up for various elements disposed within 
the terminal 202 and also the network element 204. 

The terminal voice transfer interface 212 is operably coupled to 
a network element voice transfer interface 224, disposed in the 
network element 204. The network element voice transfer interface 
224 is further operably coupled to at least one network speech 
recognition engine 226, such as OpenSpeech recognition engine 1.0, 
manufactured by SpeechWorks International, Inc. of 695 Atlantic 
Avenue, Boston, MA 02111 USA. As recognized by one skilled in the 
art, any other suitable speech recognition engine may be utilized 
herein. The at least one network speech recognition engine 226 is 
further coupled to a comparator 228 via connection 230, the 
comparator may be implemented in either hardware or software for, 
among other things, selecting at least one recognized audio command 
from the recognized audio commands received from the terminal 
speech recognition engine 216 and the network speech recognition 
engine 226. 

The comparator 228 is further coupled to the terminal speech 
recognition engine 216 disposed within the terminal 202, via 
connection 232. The comparator 228 is coupled to a dialog manager 
234, via connection 236. Dialog manager 234 is operably coupled to a 
plurality of modules, coupled to a speech synthesis engine 238, via 
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connection 240, and coupled to at least one content server 104. As 
recognized by one skilled in the art, dialog manager may be coupled to 
a plurality of other components, which have been omitted from FIG. 4 
for clarity purposes only. 

FIG. 5 illustrates a method for multi-level distributed speech 
recognition, in accordance with an embodiment of the present 
invention. As noted with reference to FIG. 4, the method of FIG. 5 
begins, step 300, when audio command is received within the terminal 
202. Typically, the audio command is provided to the terminal 202 
from a user 102 providing an audio input to the microphone 210 of 
the audio subsystem 206. The audio input is encoded in standard 
encoding format and provided to the terminal voice recognition engine 
216 and further provided to the at least one network speech 
recognition engine 226, via the terminal voice transfer interface 212 
and the at least one network element voice transfer interface 224, 
designated at step 304. 

Similar to the apparatus of FIG. 2, the terminal speech 
recognition engine recognizes the audio command to generate at least 
one terminal recognized audio command, wherein the at least one 
terminal recognized audio command has a corresponding terminal 
confidence value, designated step 306. Moreover, the at least one 
network speech recognition engine 226 recognizes the audio command 
to generate at least one network recognized audio command, wherein 
the at least one network recognized audio command has a 
corresponding network confidence value, designated at step 308. The 
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at least one network speech recognition engine 226 recognizes the 
same audio command as the terminal speech recognition, but also 
recognizes the audio command independent of the terminal speech 
recognition engine. 

5 Once the audio command has been recognized by the terminal 

speech recognition engine 216, the at least one terminal recognized 
audio command is provided to the comparator 228, via connection 
232. Also, once the at least one network speech recognition engine 
226 has recognized the audio command, the at least one network 

10 recognized audio command is provided to the comparator 228, via 
connection 230. 

In one embodiment of the present invention, the comparator 
228 weights the at least one terminal confidence values by a terminal 
weight factor and weights the at least one network confidence value by 

15 a network weight factor, designated at step 310. For example, the 
comparator may grant deference to the recognition capability of the at 
least one network speech recognition engine 226 and therefore adjust, 
i.e. multiply, the network confidence values by a scaling factor to 
increase the network confidence values and also adjust, i.e. multiply, 

20 the terminal confidence values by a scaling factor to reduce the 
terminal confidence values. 

Moreover, the method provides for selecting at least one 
recognized audio command having a recognized audio command 
confidence value from the at least one terminal recognized audio 

25 command and the at least one network recognized audio command, 
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designated at step 312. Specifically, the comparator 228 selects a 
plurality of recognized audio commands based on the recognized 
audio command confidence value. In one embodiment of the present 
invention, the dialog manager 234 provides the comparator 228 with 
5 an N-best indicator, indicating the number N of recognized audio 
commands to provide to the dialog manager 234. The comparator 228 
sorts the at least one terminal recognized audio command and at least 
one network recognized audio command by their corresponding 
confidence values and extracts the top N-best commands therefrom. 

10 In one embodiment of the present invention, the comparator 

228 may filter the at least one terminal recognized audio command 
and at least one network recognized audio command based on the 
recognized audio command corresponding confidence values. For 
example, the comparator may have a minimum confidence value with 

15 which the recognized audio command confidence values are compared 
and all recognized audio commands having a confidence value below 
the minimum confidence level are eliminated. Thereupon, the 
comparator provides the dialog manager with the N-best commands. 

Moreover, the comparator may provide the dialog manager with 

20 fewer than N commands in the event that there are less than N 
commands having a confidence value above the minimum confidence 
level. In the event the comparator fails to receive any recognized 
commands having a confidence value above the minimum confidence 
level, the comparator generates an error notification and this error 

25 notification is provided to the dialog manager via connection 236. 

20 



Furthermore, an error notification is generated when the at least one 
terminal confidence value and the at least one network confidence 
value are below a minimum confidence level, such as a confidence 
level below O.5., designated at step 314. 

In one embodiment of the present invention, the dialog manager 
may verify the at least one recognized audio command to generate a 
verified recognized audio command and execute an operation based 
on the verified recognized audio command, designated at step 316. 
For example, the dialog manager may provide the list of N-best 
recognized audio commands to the user through the speaker 208, via 
the voice transfer interfaces 212 and 214 and the speech synthesis 
engine 238, Whereupon, the user may then select which of the N-best 
commands accurately reflects the original audio command, generating 
a verified recognized audio command. 

This verified recognized audio command is then provided back 
to the dialog manager 234 in the same manner the original audio 
command was provided. For example, should the fourth recognized 
audio command of the N-best list be the proper command, and the 
user verifies this command, generating a verified recognized audio 
command, the user may then speak the word 4 into the microphone 
206 which is provided to both the terminal speech recognition engine 
216 and the at least one network speech recognition engine 226 and 
further provided to the comparator 228 where it is thereupon provided 
to the dialog manager 234. The dialog manager 234, upon receiving 
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the verified recognized audio command executes an operation based 
on this verified recognized audio command. 

The dialog manager 234 may execute a plurality of operations 
based on the at least one recognized audio command, or the verified 
5 audio command. For example, the dialog manager may access a 
content server 104, such as a commercial database, to retrieve 
requested information. Moreover, the dialog manager may execute an 
operation within a program, such as going to the next step of a 
preprogrammed application. Also, the dialog manager may fill-in the 
£ 10 recognized audio command into a form and thereupon request from 

JI the user a next entry or input for the form. As recognized by one 

J: skilled in the art, the dialog manager may perform any suitable 

operation as directed to or upon the reception of the at least one 
j=fj recognized audio command. 

5 15 In one embodiment of the present invention, the dialog manager 

may, upon receiving the at least one recognized audio command, filter 
the at least one recognized command based on the at least one 
recognized audio command confidence value and execute an operation 
based on the recognized audio command having the highest 
20 recognized audio command confidence value, designated at step 318. 
For example, the dialog manager may eliminate all recognized audio 
commands having a confidence value below a predetermined setting, 
such as below 0.6, and then execute an operation based on the 
remaining recognized audio commands. As noted above, the dialog 
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manager may execute any suitable executable operation in response 
to the at least one recognized audio command. 

Moreover, the dialog manager may, based on the filtering, seek 
to eliminate any recognized audio command having a confidence value 
below a predetermined confidence level, similar to the operation 
performed of the comparator 236. For example, the dialog manager 
may set a higher minimum confidence value than the comparator, as 
this minimum confidence level may be set by the dialog manager 234 
independent of the rest of the system 200. In the event the dialog 
manager should, after filtering, fail to contain any recognized audio 
commands above the dialog manager minimum confidence level, the 
dialog manager 234 thereupon generates an error notification, similar 
to the comparator 228. 

Once the error notification has been generated, the dialog 
manager executes an error command 234 to notify the user 102 that 
the audio command was not properly received. As recognized by one 
skilled in the art, the dialog manager may simply execute the error 
command instead of generating the error notification as performed by 
the comparator 228. 

Once the dialog manager has fully executed the operation, the 
method for multi-level distributed recognition has been completed, 
designated at step 320. 

The present invention is directed to multi-level distributed 
speech recognition through a first speech recognition engine and at 
least one second speech recognition engine. In one embodiment of the 
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present invention, the first speech recognition is disposed within a 
terminal and the at least one second speech recognition engine is 
disposed within a network. As recognized by one skilled in the art, the 
speech recognition engines may be disposed within the terminal, 
network element, in a separate server on the network being operably 
coupled to the network element, etc, wherein the speech recognition 
engines receive the audio command and provide at least one 
recognized audio command to be compared and provided to a dialog 
manager. Moreover, the present invention improves over the prior art 
by providing the audio command to the second speech recognition 
engine, independent of the same command being provided to the first 
speech recognition engine. Therefore, irrespective of the recognition 
capabilities of the first speech recognition engine, the same audio 
command is further provide to the second speech recognition. As 
such, the present invention improves the reliability of speech 
recognition through the utilization of multiple speech recognition 
engines in conjunction with a comparator and dialog manager that 
receive and further refine the accuracy of the speech recognition 
capabilities of the system and method. 

It should be understood that the implementations of other 
variations and modifications of the invention and its various aspects 
as may be readily apparent to those of ordinary skill in the art, and 
that the invention is not limited by the specific embodiments 
described herein. For example, comparator and dialog manager of 
FIG. 4 may be disposed on a server coupled to the network element 

24 



instead of being resident within the network element. It is therefore 
contemplated to cover by the present invention, any and all 
modifications, variations, or equivalents that fall within the spirit and 
scope of the basic underlying principles disclosed and claimed herein. 
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